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Agenda 


Introduction to performance analysis with ARM® DS-5 and Streamline Performance Analyzer 

2. Software Profiling 

Find hotspots, system glitches, critical conditions at a glance 
Power measurements 

3. GPU Profiling 

Using the ARM® Mali™ GPU hardware counters to find the bottleneck 

Debugging with Mali Graphics Debugger 
Overdraw and frame analysis 

5. Q&A 
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ARM 


Importance of Analysis & Debug 


■ Mobile Platforms 

Expectation of amazing console-like graphics and playing experience 

■ Screen resolution beyond HD 

■ Limited power budget 

■ Solution 

■ ARM® Cortex® CPUs and Mali™ GPUs are designed for low power 
whilst providing innovative features to keep up performance 

Software developers can be “smart” when developing apps 

■ Good tools can do the heavy lifting 



3 


ARM 



Performance Analysis & Debug 



ARM® DS-5 Streamline 
Performance Analyzer 

• System-wide performance analysis 

• Combined ARM Cortex® 
Processors and Mali™ GPU visibility 

• Optimize for performance & power 
across the system 



ARM Mali Graphics Debugger 

• API Trace & Debug Tool 

• Understand graphics and compute 
issues at the API level 

• Debug and improve performance 
at frame level 

• Support for OpenGL® ES 1,1, 2.0, 

3.0 and OpenCL™ I . I 



Offline Compilers 

• Understand complexity of GLSL 
shaders and CL kernels 

• Support for ARM Mali-4xx and 
Mali-T6xx GPU families 
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ARM 




ARM® DS-5 Streamline Performance Analyzer 


■ SystemWide Performance Analysis 

Simultaneous visibility across ARM Cortex® processors & 
Mali™ GPUs 

Support for graphics and GPU Compute performance 
analysis on Mali-T600 series 

Timeline profiling of hardware counters for detailed analysis 

■ Custom counters 

Per-co re/th read/process granularity 

■ Frame buffer capture and display 

■ Optimize 
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Performance 
Energy efficiency 
Across the system 


ARM 


The Basics 


■ Software based solution 

ICE/trace units not required 

Support for Linux kernel 2.6.32+ on target 

Eclipse plug-in or command line 

■ Lightweight sample profiling 

■ Time- or event*-based sampling 

Process to C/C++ source code profiler 
Low probe effect; <5% typically 

■ Multiple data sources 

CPU, GPU and Interconnect hardware counters 
Software counters and kernel tracepoints 
User defined counters and instrumented code 
Power/energy measurements 
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* Event-based sampling is available on kernels 3.0 or later 


User Space 


Applications & Middleware 


OpenGL® ES 


ARM® Mali™ GPU Drivers 


gator Daemon 
gator Driver 


Linux Kernel 


TCP/IP 


ARM Processor 



Target Device 








Timeline.The Big Picture 

Find hotspots, system glitches, critical conditions at a glance 


Select from 40+ CPU counters, 
OS level and custom metrics 


<1 Call Graph 3 Stack Lc 


Select one or more processes to 
visualize their instant load on CPU 



Filter processes (regex) 


Accumulate counters, measure time 
and find instant hotspots 


[libdvm. so] 
[idle] 
[libskia. so] 
[kernel] 
[libz. so] 
[libc. so] 
[libcutils. so] 
[libGLESv2_mali. so] 
[libhwui. so] 
[libMali. so] 
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ARM 





SMP Analysis 

Take advantage of multicore SMP platforms 

■ Visually trace core migration and per-core statistics 

Spot non-optimal thread synchronization and improve parallelism 



ARM 








Drilldown Software Profiling 



Click on the function name 
to go to source code level profile 


Filter timeline data to generate 
focused software profile reports 


► CPU Activity i 

□ User 
■ System 

□ 100% 





► Instruction I 

n 12 m 



■ Executed 





E3 Timeline ^ Call Paths Functions | Code -C Call Grap h g S ta ck | ^ Log ^ Warnings 



B Timeline 0 Cal 

m 

| % L* 


Self | 

Process Ti 

0.00% 

100.00% 8 

0.00% 

27.10% 2 

0.00% 

22.43% 1 

0.00% 

22.43% 1 

0.00% 

22.43% 1 

0.93% 

13.08% 1 

053% 

12.15% 

SB 


2.80% 

2.80% 

0.00% 

0.00% 

1.87% 

9.35% 

053% 

7.48% 

4.675$ [EE I 

053% 

0.93% 

053% 

0.93% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

0.00% 

4.67% 

4.67% 

0.00% 

0.00% 

0.00% 

0.00% 


ojuuu 


/|\ The source file is newer than the binary 


Total: 4 (3.03%) 



Find 










Samples 

I Line Source File: C:/Users/guimar01/DS-5/Workspaces/DS-5.9/xaos/xaos-3.5/src/engine/docalc.c 


* 


540 

if (PCHECK) 




541 

goto periodicity; 



2 . 13 % 

542 

FORMULA; 



2 . 13 % 

543 

if (PCHECK) 




544 

goto periodicity; 



6 . 38 % 

545 

FORMULA; 



2 . 13 % 

546 

if (PCHECK) 




547 

goto periodicity; 



8.51% 

548 



^■ij 

6 . 38 % 

549 

if (PCHECK) 




550 

goto periodicity; 



6 . 38 % 

551 

FORMULA; 



4 . 26 % 

552 

if (PCHECK) 




553 

goto periodicity; 


- 

« 1 


rrr 


► 

Samples 

I Address ^ 

Opcode Disassembly 

File 

* 


0x00041040 

VADD.F32 s8,s8,s2 



2 . 13 % 

0x00041044 

VADO.F32 sl3,sl3,sl3 




0x00041048 

VSUB.F32 s8,s8,s6 



2 . 13 % 

0X0004104C 

VMOV.F32 sl,s3 




0x00041050 

VMLA.F32 51,513,512 




0x00041054 

VSUB.F32 sl2,sl0,s8 




0x00041058 

EEB03AC6 VABS.F32 56,512 


c : 549 LJ 


0X0004105C 

VCMPE.F32 56,511 


c : 549 L® 

2 . 13 % 

0x00041060 

VMOV.F32 sl3,sl 



2 . 13 % 

0x00041064 

VMRS APSR_nzcv, FPSCR 




0X00041068 

VMUL.F32 s6,sl,sl 



2 . 13 % 

0X0004106C 

VMUL.F32 Sl2,s8,s8 




0x00041070 

>0^ BPL 0x00041088 ; mand_peri + 0x328 



2 . 13 % 

0x00041074 

kEf VSUB.F32 sl,s7,sl 


- 



irr 

-J 

► 


ARM 
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Bottom-Up Shared Library Analysis 


^ Timeline $ Call Paths Functions Code <Z Call Graph § Stack ^ Log 

0 [ % I 

i ( | Functions: 1 

JjS 1 m* | Samples (Self): 312 (2.43%) 


Instances 


Function Name 


1 [idle] 

77 [libdvm.so] 
1 [kernel] 

74 [libc.so] 


Location 
<unknown> <unknown> 
<unknown> <unknown> 
<unknown> <unknown> 
<unknown> <unknown> 


1.57% 

9 [libz.so] 


Top Call Paths ► 


1.22% 

17 [libMali.so] 

= 

Select Process/ThreadJn Timeline 


1.09% 

33 [dev/ashmem/dalvik-jit-c 

5] 


1.09% 

1 [libGLESvl_CM_mali.so] 

$ 

Select in Call Paths ^ 



37 [libcutils.so] 


Select in Code 


0.87% 

42 [libutils.so] 

< 

Select in Call Graph 


0.78% 

2 [gator] 

g 

Select in Stack 


0.63% 

5 [libhwui.so] 



0.54% 

6 [libGLESv2_mali.so] 


<unknown> <unknown> 

0.46% 

38 [libbinder.so] 


<unknown> <unknown> 

0.44% 

2 [libcrypto.so] 


<unknown> <unknown> 



Processes or call paths using it 
will be automatically highlighted 


Select the library or function to look 
into, then navigate to Call Paths 
or Timeline 


| Timeline | g_ Call Paths Functions Code < Call Graph _g Stack Lo g 


l&l * r]asJLt) |*±! m G 


Filter processes (regex) 



▼ CPU Activity 

□ User 
■ System 


► Clock 

* I 

■ Cydes 


► Instruction 

\ 

■ Executed 




► [com.aridroidJaurctier #1444] 


► :-y-tr-_;er/ef *1250] 


I 

fllifll 


► [s^acef >ger*1155] 


► [com.arcrod-rosefed *1400] 


► [com^ndroidveridirig *2C93] 


► [com.gocg e araroa.gsf og " *2129] 


[woa.supprffcant *6333] 


► [co'-.android^ysts'-- *1322] 


I 1 


► [meciaserYer*1158] 

► rcomjndroid-settinos *15651 


Ml 



Samples OO 

wkm 

[idle] 


[kernel] 

ii 

i ■ 

ii 

[libdvm.so] 

[libc.so] 

[libskia.so] 

[libz.so] 

[libMali.so] 

[dev/ashmem/dalvik-jit-code-cach-. 

[libGLESvl_CM_mali. so] 
[libcutils.so] 

1 rrr | ► 
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Power Measurement Interfaces 

Data Acquisition ARM DS-5 Streamline Performance Analyzer 


ARM® Energy Probe 

• 3-channel 

• System-level analysis 

• Easy to deploy 

• Affordable 



Good for trend spotting and 
application optimization 


Nl DAQ USB-62xx 

• 40+ analog inputs 

• Subcomponent sensitivity 

• High fidelity 

• Higher cost 



Good for OS power management tuning and benchmarking 



1 1 


ARM 




The Power of Having It All in One Place 

How effective are you at managing your energy budget? 


| Timeline $ Call Paths Functions Code “C Call Graph § 


&i nicnami 


How long does the power manager take to 
respond to changes in CPU load? 


Monitor instant voltage, current and 
power per channel 



[ [idle] 


f P<efTd1 


( ► [system_sefver#1238] 

1 

(► [s^rfacef irger #1147] 


► [adbd #1155] 


4 
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ARM 





Community Edition vs. Professional Edition 


Which is the right 
version for you? 




“O 

LU 


d 

c 

O 


on 

on 


< 2 > 

o 

L. 

CL 


r 

BSP / 

1 

D i 

stributi 

ion 

L 

Makers 

A 


r 



■ 

DEMs 

ODMs 

' 

L 


A 



Typical Use Case 
Program Images 
Timeline View 

* Performance Charts 

* Process Bars 

* ARM® Mali™ GPU Analysis 

* Quick Profile Summary 

* Core Affinity Mode 

* Energy Probe data capture 

* Time Filtering 

* Annotation 
Call Paths View 
Functions View 
CodeView 
Call Graph 
Stack View 
Log View 
Command Line 
Event Based Sampling 


Community 


Simple application profiling 
I 



✓ 


✓ 




Professional 


System-wide, SMP analysis 
Limited to host memory 



v" 

v" 

v" 

✓ 

v" 

v" 

✓ 

✓ 
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ARM 
























Main Bottlenecks 


■ CPU 

■ Too many draw calls 
Complex physics 

■ Vertex processing 

■ Too many vertices 

Too much computation per vertex 

■ Fragment processing 

Too many fragments, overdraw 
Too much computation per fragment 

■ Bandwidth 

Big and uncompressed textures 

■ High resolution framebuffer 

■ Battery life 

Energy consumption strongly affects User 
Experience 
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CPU 


Vertices 

Textures 

Uniforms 

Memory 

Vertices Triangles f^ures |¥ Ws 

Uniforms Varyings Varyings 


Vertex 

Shader 


Fragment 

Shader 


ARM 



ARM® Mali™ Graphics Debugger 



,1319) l59. 


Tvert^.8^ 
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6 
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Graphics debugging for content developers 
API level tracing 

Understand issues and causes at frame level 

Support for OpenGL® ES 2.0, 3.0, EGL™ & 
OpenCL™ l.l 

Complimentary to DS-5 Streamline 

v 1 .2.2 released in February 
v 1 .3 will be available soon 


ARM 


Investigation with the ARM® Mali™ Graphics Debugger 


BOO 


& B/ 

□E Outline S3 

° □ 

1 A Assets S3 . 



Mali Graphics Debugger 


Frame 1 
Frame 2 
Frame 3 
Frame 4 


| Frame Outline 


2 draws, 8 vertices, 8 unique ind 
2 draws, 8 vertices, 8 unique ind 
0 draws, 0 vertices, 0 unique ind 
£ draws, 8 vertices, 8 unique ind 
draws, 16 vertices, 16 unique i 
draws, 16 vertices, 16 unique i 
■aws, 16 vertices, 16 unique i 
Frame 8 : 4 drakvs, 16 vertices, 16 unique i 
Frame 9 : 4 dra^Ll6 vertices, 16 unique i 
Frame 10 : 7 drawsm8 vertices, 28 unique 
7 draws, 2a^ertices, 28 unique 
5 draws, 20 ^ttices, 20 unique 
5 draws, 20 verakgf , 20 unique 

5 draws, 20 verticS^ 20 unique 

6 draws, 80 vertices, 60 unique 
2 draws, 8 vertices, 8 unique in 



Frame 11 
Frame 12 
Frame 13 
Frame 14 
Frame 15 
Frame 16 

Frame 17 : 642 draws, 940520 vertices, 4] 
Frame 18 : 154 draws, 224904 vertices, 9? 

154 draws, 224904 vertices, 9f 
[raws, 224904 vertices, 9£ 

223 draw^^M^j^je^lB 
225 draws, 375077 vertices 

224 draws, 375081 vertices, IB 
224 draws, 375081 vertices, IB 
224 draws, 375087 vertices, IB 
236 draws, 386247 vertices, l4 
236 draws, 386247 vertices, lJ 
236 draws, 386253 vertices, 1^ 
236 draws, 386253 vertices, 14 
236 draws, 386253 vertices, 14 

386295 vertices, 14 

231 draws, 


Frame 33 : 230 draws, 383256 vertices, b 


I gIDrawElements : 1194 vertices 256 u 
gIDrawElements : 918 vertices 202 un 
gIDrawElements : 918 vertices 202 un 
gIDrawElements : 336 vertices 83 uni< 

5 gIDrawElements : 1839 vertices 459 u 

6 gIDrawElements : 2802 vertices 732 u 
4 ? 7 gIDrawElements : 6243 vertices 2499 
^8 gIDrawElements : 2529 vertices 545 u 
^9 gIDrawElements : 312 vertices 152 un 
^/10 gIDrawElements : 54 vertices 22 uni< 

II gIDrawElements : 1704 vertices 392 

in nlnrau/Flomontc 3QC wartifoc 1Q/1 .1 


FTam e77 
Frame 23 : 
Frame 24 : 
Frame 25 : 
Frame 26 : 
Frame 27 : 
Frame 28 : 
129 

ho : 



Assets View 


▼ GLES 

► Buffers 

▼ Qj Framebuffers 

O Framebuffer 0 

► ^Programs 

► Shaders 

► ^Textures 

► EGL 

► CL 


(622,1319) [136, 159, 173] 


0 Alpha 



# 

Error 

7 0X03 

UL_INU_LKKUK 

78110 

GL_NO_ERROR 

78111 

GL_NO_ERROR 


_ GL_NO_ERROR 

78113™ 


78114 

GL_NO_ERROR 

78115 

GL_NO_ERROR 


yiveriexMiiriuruiriierunux=i, bi ze=z, iype=ijL_rLU« i , nurrridiizeu = 
glVertexAttribPointer(indx=2, size=4, type=GL_FLOAT, normalized^ 
glBindBuffer(target=GL_ELEMENT_ARRAY_BUFFER, buffer=0) 
glDrawElements(mode=GL_TRIANGLES, count=12, type=GL_UNSIGNED 
glViewport(x=0, y=0, width=2560, height=1504) 
glDepthRangef(n=0.0, f=1.0) 
glViewport(x=0, y=0, width=2560, height=1504) 





78117 

EGL_SUCCESS 

EGL_SU... eglGetErrorQ 



78119 

GL_NO_ERROR 

78120 

GL_NO_ERROR 

78121 

GL_NO_ERROR 

78122 

GL_NO_ERROR 


GL_NO_ERROR 

7812^ 

^S^^ERROR 



glEnable(cap=GL_DEPTH_TEST) 

glDisable(cap=GL_SCISSOR_TEST) 

glViewport(x=0, y=0, width=2560, height=1504) 

glDepthRangef(n=0.0, f=1.0) 

glDisable(cap=GL_BLEND) 

glViewport(x=0, y=0, width=2560, height=1504) 


D-. 


% Trace Analysis 


Message 

Count 

6b 

Indices buffer may be too sparse (total sparseness: 3.16) 

16680 

6b 

Vertex attribute 'pointer' less than zero 

1454 

A 

Unexpected buffer size (was 4 but was expecting 2) 

4 

6b 

Unexpected buffer type (was GL_HALF_FLOAT_OES but was expecting GL_FLOAT) 

259 

Q 

Current program is not set 

38 

6b 

Unexpected buffer type (was GL_UNSIGNED_BYTE but was expecting GL_FLOAT) 

4 


Statistics £3 


Total number of API function calls 
Average vert/frame 
Average instanced vert/frame 
Average draw/frame 
Frame # 

Number of API calls in current frame 
Number of draw calls in current frame 
Number of vertices submitted in current frame 
API call # 

Number of vertices in current draw call 
Numhpr of uninue indices in current draw call 


Target State S3 


Buffers 


Uniforms 


State Item 


- Value 


GL_BLEND_EQUATION_ALPHA 

GL_BI_END_EQUATION_RGB 

GL_BLEND_SRC_ALPHA 

GL_BLEND_SRC_RGB 

gl_blue_bits 

G L_CO LO R_C L EAR_VALU E 
GL_COLOR_WRITEMASK 



Q Quick Access ^ A Frame Statistics 


GL_FUNC_ADD 

GL_FUNC_ADD 

GL_ZERO 

GL_SRC_ALPHA 

<unknown> 

0 . 0 , 0 . 0 , 0 . 0 , 1.0 

GL_TRUE, GL_TRUE, GL_TRUE, GL... 


States 
Uniforms 
Vertex Attributes 
Buffers 


GL_COMPRESSED_TEXTURE_FORMATS <unknown> 
GL_COPY_READ_BUFFER_BINDING 
GL_COPY_WRITE_BUFFER_BINDING 
GL_CULL_FACE 
GL_CULL_FACE_MODE 
GL_CURRENT_PROGRAM 
GL_DEPTH_BITS 
GL_DEPTH_CLEAR_VALUE 
GL DEPTH FUNC 

Vertex Shaders 



Textures 

Shaders 


! 


Program 

Name 

Instruction Shortest pi Longest pi 

: Instances i 

Total cy< 

280 

Shader 281 

39 

39 

39 

213174 

8313786 

| 301 

Shader 302 

38 

38 

38 

41133 

1563054 

298 

Shader 299 

38 

38 

38 

20400 

775200 

1442 

Shader 443 

45 

45 

45 

13485 

606825 

175 

Shader 176 

22 

22 

22 

22128 

486816 

| 289 

Shader 290 

55 

55 

55 

7962 

437910 

1313 

Shader 314 

39 

39 

39 

11199 

436761 

286 

Shader 287 

70 

70 

70 

5535 

387450 

211 

Shader 212 

22 

22 

22 

14646 

322212 

199 

Shader 200 

22 

22 

22 

13008 

286176 

292 

Shader 293 

55 

55 

55 

4164 

229020 

262 

Shader 263 

36 

36 

36 

3348 

120528 

187 

Shader 188 

48 

48 

48 

2484 

119232 

181 

Shader 182 

61 

61 

61 

1767 

107787 

| 316 

Shader 317 

70 

70 

70 

1110 

77700 

214 

Shader 215 

61 

61 

61 

1110 

67710 
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ARM 




ic Citadel 
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ARM 


0.1s 11000ms) 


▼ CPU Activity 



_ CPU Fragment 

I ° 


CPU Vertex-Tiling-Com... 



Mali Arithmetic Pipe 

o s 


Mali Core Cycles 

O 


'ripipe cycles 


Mali Load/Store Pipe 




(idle] 


► |kemel| 


► Isurfaceflinger #129] 


► Imediaserver #132] 


► |adbd #139) 


► |system_server #465) 




► Icom.epicgames.EpicCitadel #1932) 


_IHIIIIIIHIIIIIHHLIIIHIIIIEII 


LLL 


J 


■■mi mu mi ii 


_ 


I I 



► |com. android. vending #1109] 












Profiling via ARM® DS-5 Development Studio 


■ DS-5 Streamline to capture data 

■ Google Nexus 10, Android™ 4.4 

■ Dual core ARM Cortex®-AI 5, Mali™-T604 

■ Low CPU activity (CPU Activity -> User) 
that averages to ,4% over one second 

■ Burst in GPU activity: 99% utilization 
( GPU Fragment —* Activity) 

■ While rendering the most complicated 
scene, the application is capable of 36 
fps (29ms/frame) 


O 23.S7X avg. 
O O.OOft avg. 
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ARM 



The Application is GPU bound 

The CPU has to wait until the fragment processing has finished 


► CPU Activity 

I' 



GPU Fragment 

■ 


GPU Vertex-Tiling-Com.. 
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ARM 






ARM® Mali™ GPU Hardware Counters 


■ Over the highlighted time of one second 
the GPU was active for 448m cycles 
(Mali Job Manager Cycles —* GPU cycles ) 

■ With this hardware, the maximum 
number of cycles is 450m 

■ A first pass of optimization would lead 
to a higher frame rate 

■ After reaching V-SYNC, optimization can 
leads to saving energy and to a longer 
play time 


Mali Job Manager Cycles # 



Mali L2 Cache 
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ARM 





Vertex and Fragment Processing 


GPU is spending: 


Fragment Count 
Per Program 


1 86m (29%) on vertex processing 
(ARM® Mali™ job Manager Cycles ~~>JSI cycles) 

448m (70%) on fragment processing 
( Mali Job Manager Cycles JSO cycles) 



Fragment Cycles 
Vertex Cycles 
Setup work 


There might be an overhead in the job manager trying to optimize vertex list packing into 
jobs. 
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ARM 


ARM® Mali™-T628 GPU Tripipe Cycles 


Arithmetic instructions 

■ Math in the shaders 

Load & Store instructions 

■ Uniforms, attributes and varyings 

Texture instructions 

■ Texture sampling and filtering 


Instructions can run in parallel 
Each one can be a bottleneck 




Arithmetic 

Pipeline 


Arithmetic 

Pipeline 


Thread Issue 



Thread Completion 


There are two arithmetic pipelines so 

we should aim to increase the arithmetic workload 



Mali™-T628 


Inter-Core Task Management 


Shader 

Core 


Shader 

Core 


Shader 

Core 


Shader 

Core 


MMU 

MMU 

Level 2 Cache 

Level 2 Cache 

AMBA®4 ACE-Lite™ 

AMBA®4 ACE-Lite™ 
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ARM 





Inspect the Tripipe Counters 

Reduce the load on the L/S pipeline 



Texture 1 97m 

1 Arithmetic 1 05m 
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Mali Job Manager Cycles <Jr 



ARM 













Tripipe Counters 

Cycles per instruction metrics 

■ It’s easy to calculate a couple of CPI (cycles per instruction) metrics: 


■ For the load/store pipeline we have: 

408m (Mali Load/Store Pipe — * LS instruction issues) 
/195m (Mali Load /Store Pipe —> LS instructions) 

= 2.09 cycles/instruction 

■ For the texture pipeline we have: 

I 97m (Mali Texture Pipe —>T instruction issues) 

/ 69m (Mali Texture Pipe —>T instructions) 

= 1.16 cycles/instruction 


100 % 

80 % 

60 % 

40 % 

20 % 

0 % 

Load & Texture 

Store Pipeline 

Pipeline 



■Stalls 

Instructions 
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ARM 


It 

Vertex 

Shader 
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CPU 


I 

Memory 

it 


Fragment 

Shader 


ARM 



CPU Bound 


CPU 


1 

Memory 



Vertex 

Shader 
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Fragment 

Shader 


ARM 




CPU Bound 


■ Mali GPU is a deferred architecture 

Do not force a pipeline flush by reading 
back data (gl Read Pixels, gIFinish, etc.) 

Reduce the amount of draw calls 

■ Try to combine your draw calls together 

■ Offload some of the work to the GPU 

■ Move physics from CPU to GPU 

■ Avoid unnecessary OpenGL® ES calls 
(gIGetError, redundant stage changes, 
etc.) 
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Synchronous Rendering 






















CPU 
























GPU 













Frame 1 


Frame 2 


Deferred Rendering 






& 






.r 

.* * .* 
✓w 


CPU 



















GPU 











Frame 1 


Frame 2 


Frame 3 


ARM 




Vertex Bound 


CPU 




Vertex 

Shader 


Memory 
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Fragment 

Shader 


ARM 


Vertex Bound 


■ Get your artist to remove unnecessary 
vertices 

■ LOD switching 

■ Only objects near the camera need to be 
in high detail 

Use culling 

Too many cycles in the vertex shader 
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c 


Frame 33 : 230 draws, 383256 vertices, 142835 unique in 


4/1 gIDrawElements : 
_/2 gIDrawElements : 
_y'3 gIDrawElements 
gIDrawElements : 
ys gIDrawElements : 
6 gIDrawElements 
y 7 gIDrawElements : 
y ' 8 gIDrawElements 
y 9 gIDrawElements : 
y\0 gIDrawElements 
^11 gIDrawElements 
^12 gIDrawElements 
j / 13 gIDrawElements 
^14 gIDrawElements 
^15 gIDrawElements 
y gIDrawElements 
j/17 gIDrawElements 
yi8 gIDrawElements 
^19 gIDrawElements 
_y 20 gIDrawElements 
_/21 gIDrawElements 
y22 gIDrawElements 
_/23 gIDrawElements 


1194 vertices 256 unique indices 
918 vertices 202 unique indices 
918 vertices 202 unique indices 
336 vertices 83 unique indices 
1839 vertices 459 unique indices 
2802 vertices 732 unique indices 
6243 vertices 2499 unique indices 
2529 vertices 545 unique indices 
312 vertices 152 unique indices 
: 54 vertices 22 unique indices 
: 1704 vertices 392 unique indices 
: 396 vertices 194 unique indices 
: 4038 vertices 1124 unique indices 
: 8220 vertices 2198 unique indices 
: 564 vertices 291 unique indices 
: 528 vertices 233 unique indices 
: 2166 vertices 681 unique indices 
: 3858 vertices 2067 unique indices 
: 702 vertices 468 unique indices 
: 1671 vertices 808 unique indices 
: 2322 vertices 836 unique indices 
: 2277 vertices 917 unique indices 
: 4251 vertices 1131 unique indices 


ARM 


Vertex Count and Shader Optimizations 

Identify the top heavyweight vertex shaders 


jl citadell3c.mgd 


Shader 176 £2 








Assets 

/ Vertex Shaders 23 

Fragment Shaders 

± Textures 

E 

float; 



float SquareCfloat A) 

{ 

return A * A; 

} 

vec2 SquareC vec2 A) 

{ 

return A JL 

} 

SquareC vec3 A) 


return A * A; 

} 

vec4 SquareC vec4 A) 

{ 

return A * A; 

} 

float EncodeLightAttenuationCfloat InColfr) 

{ 

return sqrtCInColor) ; 

} 

vec4 EncodeLightAttenuationC vec4 Igftolor) 

{ 

return sqrtCInColor) ; 


unl'h^rm mat3 TextureTransfonj 

void ifPrri prFi rr ^ • 

uniform mat4 LocalToWorld ; 
uniform mat3 LocalToWorldRotation ; 
uniform mat4 ViewProjection ; 
uniform mat4 LocalToProjection ; 
uniform vec4 FadeColorAndAmount : 



181 

211 

289 

298 

208 

214 

73 

262 


Shader 182 
Shader 212 
Shader 290 
Shader 299 
Shader 209 
Shader 215 
Shader 74 
Shader 263 


Instruction Shortest pi Longest pt Instances r Total cy* 
22 


48 

61 

22 

55 

38 

22 

61 

20 

36 


39 

48 

61 

22 

55 

38 

22 

61 

20 

36 


22 

39 

48 

61 

22 

55 

38 

22 

61 

20 

36 


148500 

80595 

26328 

11487 

14646 

5484 

5259 

4257 

1110 

2880 

1152 


3267000 

3143205 

1263744 

700707 

322212 

301620 

199842 

93654 

67710 

57600 

41472 


Vertex Cycles Per Program 



Program 175 
Program 280 
Program 187 
Others 


N 


ARM 


Fragment Bound 


CPU 




Vertex 

Shader 


Memory 


32 



Fragment 

Shader 


ARM 



Fragment Bound 


■ Render to a smaller framebuffer 

■ Move computation from the fragment 
to the vertex shader (use HW 
interpolation) 

Drawing your objects front to back 
instead of back to front reduces 
overdraw 

■ Reduce the amount of transparency in 
the scene 
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ARM 


Overdraw 


This is when you draw to each pixel on the screen 
more than once 

Drawing your objects front to back instead of back 
to front reduces overdraw 

Limiting the amount of transparency in the scene 
can help 



Overdraw 


Overdraw Factor 


We divide the number of output pixels 
by the number of fragments, each 
rendered fragment corresponds to one 
fragment thread and each tile is 16x16 
pixels, thus in our case: 


90.7m (Mali Core Threads — * Fragment threads) 

/ I43K (Mali Fragment Tasks —r Tiles rendered) x 256 
= 2.48 threads/pixel 


Overdraw 



ll.ll Title: Overdraw 0 Average Selection 0 Average Cores 0 Percentage ^ 


I 


Name: 


Overdraw Factd Tooltip: 


Expression: SMaliCoreThreadsFragm 
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ARM 



Frame Capture 
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Frame Analysis 

Check the overdraw factor 



Draw 181, total vertices: 118594 


.iraw 10a, total vertices: 
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Shader Map and Fragment Count 

Identify the top heavyweight fragment shaders 




Fragment Count Per Program 


Assets 


Vertex Shaders 


Fragment Shaders S3 


i Textures 


structions Shortest Longest Instances Total cycles* 



175 

Shader 177 

5 ^ 

5 

5 

7537773 


280 

Shader 282 

UP l&BLI lUo 

5^X 

5 

5 

5 

5 

1459254 

415710 


187 

Shader 189 

6 

6 

6 

197329 


73 

Shader 75 

4 

4 

4 

279555 


382 

Shader 384 

8 

8 

8 

129913 

■ 

289 

Shader 291 

6 

6 

6 

16856 


208 

Shader 210 

7 

3 

6 

7975 


262 

Shader 264 

5 

5 

5 

6025 

■ 

400 

Shader 402 

5 

5 

5 

914 


37688865 

7296270 

2078550 

1183974 

1118220 

1039304 

101136 

39875 

30125 

4570 


~IOm instances 
/ (2560x 1 600) pixel 
= 2.44 


Program 175 
■ Program 280 
Program 181 
Others 


ARM 


Draw 185, total vertices: 120380 


Shader Optimization 


■ Since the arithmetic workload is not 
very big, we should reduce the number 
of uniform and varyings and calculate 
them on-the-fly 

■ Reduce their size 

■ Reduce their precision: is highp always 
necessary? 

■ Use the Mali Offline Shader Compiler! 

http://malideveloper.arm.com/develop-for-mali/ 

tools/analysis-debug/mali-gpu-offline-shader- 

compiler/ 
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Shader 177 £3 


□ 


uniform sampler2D TextureBase ; 

varying highp vec4 UVBase ; 

uniform sampler2D TextureLightmap ; 

varying highp vec2 UVLightmap ; 

varying lowp vec4 GlobalEffectColorAndAmount ; 

void main() 

{ 

lowp vec3 DebugColor; 

highp vec2 FinalBaseUV = UVBase. xy; 

highp vec2 TransformedFinalBaseUV = UVBase. zw; 

highp vec2 BaseTextureCoord; 

BaseTextureCoord = FinalBaseUV; 

lowp vec4 BaseColor = texture2D(TextureBase , BaseTextureCoord, - 0.50 ) ; 
lowp float AlphaVal = BaseColor.a; 

{ 

} 

ALPHAKILLC AlphaVal ) 

BaseColor .xyz = BaseColor .xyz ; 

lowp vec4 PolyColor = vec4(BaseColor .xyz, AlphaVal); 
lowp vec3 EnvironmentSpecular = vec3 ( 0 , 0 , 0 ); 
lowp vec3 TotalDiffuseLight = vec3C 0 . 0 , 0 . 0 , 0.0 ); 

PolyColor . rgb += EnvironmentSpecular ; 

lowp vec3 PreSpecularPolyColor = PolyColor . rgb; 

{ 

lowp vec3 LightmapColor = texture2D( TextureLightmap, UVLightmap ).rgb; 

LightmapColor = LightmapColor ; 

PolyColor . rgb = PolyColor . rgb * LightmapColor ; 

PolyColor . rgb += PolyColor . rgb; 

} 

PolyColor .xyz = (PolyColor .xyz * GlobalEffectColorAndAmount .w) + GlobalEffectColorAndAmou 


PolyColor .xyz = PolyColor .xyz ; 
gl_FragColor = PolyColor ; 


ARM 


Bandwidth Bound 


CPU 

I 

Memory 


it 

Vertex 

Shader 


40 


Fragment 

Shader 


ARM 





Bandwidth 


When creating embedded graphics 
applications bandwidth is a scarce 
resource 

A typical embedded device can handle 5.0 
Gigabytes a second of bandwidth 

A typical desktop GPU can do in excess of 
100 Gigabytes a second 


The application is not bandwidth bound as 
it performs, over a period of one second: 

(96m ( Mali L2 Cache —*■ External read beats ) + 

90.7m (Mali L2 Cache —*■ External write beats)) x 1 6 
~= 2.9 GB/s 

■ Since bandwidth usage is related to energy 
consumption it’s always worth optimizing it 


Mali L2 Cache 


External read beats 




• L 2 read hits 


L 2 write hits 
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Bandwidth Bound 


Vertices 

■ Reduce the number of vertices and 
varyings 

■ Interleave vertices, normals, texture 
coordinates 

■ Use Vertex Buffer Objects 

Fragments 

■ Use texture compression 

■ Enable texture mipmapping 

This will also cause a better cache 
utilization. 
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-6262.634, -4691.77... 

0.34643... 

1.82812... 

-6262.3794, -4696.2... 

0.34936... 

1.82812... 

-6260.96. -4721. 09... 

0.35571 

1 . 8781 ?. .. 


-6260.889, -4722 
-6390.977, -4722 
-6391.1245, -472 


Indices sparseness: I 
bad for caching! 


.47 


-6261.173, -4717.64... 

U. JVJ J / U... " 

0.35278... 

i.828T?rr>« 

-6390.977, -4722.63... 

0.37231... 

-9.99569... 

-6572.49, -4735.830... 

0.38476... 

0.23718... 

-6572.4927, -4735.8... 

0.38183... 

0.23742... 

-6390.977, -4722.63... 

0.36938... 

-9.99569... 

-6664.2188, -4760.8... 

0.39672... 

0.33959... 

-6664.222, -4760.90... 

0.39453... 

0.33959... 

-6783.513, -4827.17... 

0.41210... 

0.49975... 

-6783.5063, -4827.1... 

0.41406... 

0.49975... 

-6866.2744, -4896.6... 

0.43188... 

0.62255... 

-6866.309, -4896.70... 

0.43017... 

0.62255... 

-6932.6846, -4974.8... 

0.44946... 

0.70703... 

-6932.441, -4974.81... 

0.45068... 

0.70751... 

-6999.1626, -5076.3... 

0.47216... 

0.82861... 

-6999.5806, -5076.3... 

0.47119... 

0.82861... 

-6581.1475, -4706.2... 

0.37060... 

0.23742... 

-6394.6104, -4692.0... 

0.35668... 

-9.99569... 

-6394.081, -4696.53... 

0.36010... 

-9.99569... 

_CC7Q CQ/17 _47in C 

n 373C3 

n 737 AO 


137, 199, 233, 255 
141, 140, 254, 255 
14£L45, 224, 255 
1, 145, 255 
3, 152, 255 
30, 208, 255 
109, 252, 255 
109, 253, 255 
132, 5? 

149, 2, 127, 255 
149, 3, 145, 255 
135, 3, 152, 255 
171, 8, 127, 255 
178, 12, 148, 255 
201, 26, 148, 255 
196, 20, 127, 255 
215, 35, 128, 255 
218, 40, 147, 255 
229, 53, 147, 255 
228, 49, 129, 255 

238, 64, 130, 255 

239, 69, 147, 255 
122, 216, 219, 255 
134, 230, 203, 255 
144, 153, 251, 255 

-\Af\ 1.40 TCJ 7CC 




ARM 


Textu res 

Save memory and bandwidth with texture compression 


The current most popular format is 
ETC Texture Compression 

■ ButASTC (Adaptive Scalable Texture 
Compression) can deliver < I bit/pixel 


Texture Weight by Dimension 
(Uncompressed RGBA) 


2 % 



Other 

■ 256x 256 

■ 5 1 2 x 384 

■ 512x512 
1024 x 1024 

■ 2560 x 1504 
2048 x 2048 


Uncompressed ■ETCI ASTC 5x5 
944 MB 



Total Texture Memory 




Name 

Size a 

Format 

Type 


Texture 45 

2048 x 2048 

GL_RGBA 

GL_UNSIGNED_BYTE 


\ Assets 


Vertex Shaders 


y Fragment Shaders 


Textures £2 











Texture 241 2048 x 2048 GL ETCI RGB8 OES 


Texture 243 2048 x 2048 GL ETCI RGB8 OES 


Texture 246 2048 x 2048 GL ETCI RGB8 OES 


Texture 259 2048 x 2048 GL_ETCl_RGB8_OES 


Texture 263 2048 x 2048 GL_ETCl_RGB8_OES 


Texture 267 2048 x 2048 GL_ETCl_RGB8_OES 


Texture 268 2048 x 2048 GL ETCI RGB8 OES 


Texture 270 2048 x 2048 GL ETCI RGB8 OES 


Texture 275 2048 x 2048 GL ETCI RGB8 OES 
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Transaction Elimination 

Helps reduce bandwidth consumption 

This technology prevents the game from 
wasting bandwidth while still utilizing GPU 
resources to render tiles that haven’t 
changed from previous frames. 

■ Every time the GPU resolves a tile-full of 
color samples, it computes a signature 

Each signature is written into a list 
associated with the output color buffer 

■ The next time it renders to that buffer, if 
the signature hasn't changed, it skips 
writing out the tile 

More about Transaction Elimination here: 
http://community.arm.com/groups/arm-mali-graphics/blog/20l2/08/l7/how-low-can-you-go-building-low-power-low-bandwidth-arm-mali-gpus 
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Transaction Elimination 


Camera moving in the scene 


Mali Fragment Tasks 


• Tile writes killed by TE 


obsssb 







Loading screen 


Mali Fragment Tasks <Jr 


# Tile writes killed by TE 
Tiles rendered 
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Tiles Written 

■ Tiles Killed by 
Transaction Elimination 



Tiles Written 

■ Tiles Killed by 
Transaction Elimination 


ARM 


Vertex Buffer Objects 


Using Vertex Buffer Objects (VBOs) can 
save you a lot of time in overhead 

■ Every frame in your application, all of your 
vertices and colour information will get sent 
to the GPU 

■ A lot of the time these won’t change. So 
there is no need to keep sending them 

■ Would be a much better idea to cache the 
data in graphics memory 
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monopoly-mgdl3.mgd £3 


# 

Error 

Return 

Function Call 

376441 

GL_NO_ERROR 


glEnable(cap=GL_CULL_FACE) 

376442 

GL_NO_ERROR 


glCullFace(mode=GL_BACK) 

376443 

GL_NO_ERROR 


g 1 Use Prog ram(prog ram =38) 

376444 

GL_NO_ERROR 


glEnableVertexAttribArray(index=0) 

376445 

CL_NO_ERROR 


glEnableVertexAttribArray(index=l) 

376446 

GL_NO_ERROR 


glEnableVertexAttribArray(index=2) 

376447 

GL_NO_ERROR 


glVertexAttribPointer(indx=0, size=2, type=GL_FLOAT, normalized=GL_F> 

376448 

GL_NO_ERROR 


glVertexAttribPointer(indx=l, size=4, type=GL_FLOAT, normalized=GL_F/ 

376449 

GL_NO_ERROR 


glVertexAttribPointer(indx=2, size=2, type=GL_FLOAT, normalized=GL_F 

376450 

GL_NO_ERROR 


glUniformMatrix4fv(location=2, count=l, transpose=GL_FALSE, value= 

376451 

GL_NO_ERROR 


glUniformMatrix4fv(location=3, count=l, transpose=GL_FALSE, value= 

376452 

GL_NO_ERROR 


glDisable(cap=GL_SGSSOR_TEST) 

376453 

GL_NO_ERROR 


glScissor(x=0, y=0, width=2464, height=1504) 

376454 

GL_NO_ERROR 


glClearColor(red=0.0, green=0.0, blue=0.0, alpha=0.0) 

376455 

GL_NO_ERROR 


glDisable(cap=GL_CULL_FACE) 

376456 

CL_NO_ERROR 


glClearColor(red=0.0, green=0.0, blue=0.0, alpha=0.0) 

376457 

GL_NO_ERROR 


glDisable(cap=GL_CULL_FACE) 

376458 

GL_NO_ERROR 


glClearColor(red=0.0, green=0.0, blue=0.0, alpha=0.0) 

376459 

GL_NO_ERROR 


glDisable(cap=GL_CULL_FACE) 

376460 

GL_NO_ERROR 


glDisable(cap=GL_BLEND) 

376461 

EGL.SUCCESS 

EGL_SU... 

eglGetError() 

j 376462 

EGL_SUCCESS 

EGL_TRUE 

eglSwapBuffers(dpy=0x71395be8, surface=0x78c9ef50) 

376463 

gl_no_error 


glDisable(cap=GL_BLEND) 

376464 

GL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

376465 

GL_NO_ERROR 


g 1 De pth Mask(flag =GL_TRU E) 

376466 

GL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

376467 

gl_no_error 

GL_NO_... 

gIGetErrorO 

376468 

GL_NO_ERROR 


glBindTexture(target=GL_TEXTURE_2D, texture=0) 

376469 

CL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

376470 

gl_no_error 

GL_NO_... 

gIGetErrorO 

376471 

CL_NO_ERROR 


glUniformMatrix4fv(location=2, count=l, transpose=GL_FALSE, value= 

376472 

GL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

376473 

GL_NO_ERROR 


glClear(mask=<0x4100>) 

376474 

CL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

376475 

GL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

376476 

GL_NO_ERROR 


glBindBuffer(target=GL_ARRAY_BUFFER, buffer=3) 

376477 

CL_NO_ERROR 

GL_NO_... 

gIGetErrorO 

-,-ifr Aia 

/-i no rnnon 

ft 

/\ 


'O Trace Analysis S3 


0 Console 


Message 

A 

A 

O 

& 

A 

Offset is beyond the end of the buffer 
Vertex attribute 'pointer' less than zero 
Detected 8 errors returned from functions. 

Indices buffer may be too sparse (total sparseness: NaN) 
Vertex attrib. array is enabled but no data is available 



i 

Detected 17140 calls to gIVertexAttribPointer without a bound vertex buffer object 


i 

100.00% of the draw calls are using GL_TRIANGLES 



ARM 



Summary 


■ Covered today: 

Introduction to performance analysis 
Software Profiling 

■ GPU Profiling 

■ Debugging with the ARM® Mali™ 
Graphics Debugger 
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more information: 


www.malideveloper.arm.com 

www.ds.arm.com 


www.community.arm.com 


ARM 


Thank You 


Any Questions? 


The trademarks featured in this presentation are registered and/or unregistered trademarks of ARM Limited (or its subsidiaries) in the EU 

and/or elsewhere. All rights reserved. Any other marks featured may be trademarks of their respective owners 
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The Architecture for the Digital World 
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